COVID19 Exploratory Data Analysis with PANDAS and creating interactive plots with plotly

Step 1: Load the libraries and data

In [ ]:
import os
import numpy as np
import pandas as pd
In [2]:
data = pd.read_csv('COVID-19-Cases.csv')
data.head()
Out[2]:
Case_Type Cases Difference Date Country_Region Province_State Admin2 Combined_Key FIPS Lat Long Prep_Flow_Runtime Table_Names
0 Deaths 0.0 0.0 2/27/2020 Bahamas NaN NaN NaN NaN 25.0343 -77.3963 3/30/2020 Time Series
1 Confirmed 16.0 0.0 2/16/2020 Germany NaN NaN NaN NaN 51.0000 9.0000 3/30/2020 Time Series
2 Deaths 0.0 0.0 1/31/2020 Canada Alberta NaN NaN NaN 53.9333 -116.5765 3/30/2020 Time Series
3 Deaths 0.0 0.0 3/4/2020 Australia Queensland NaN NaN NaN -28.0167 153.4000 3/30/2020 Time Series
4 Confirmed 0.0 0.0 1/25/2020 Suriname NaN NaN NaN NaN 3.9193 -56.0278 3/30/2020 Time Series

pandas read_csv function is used to load the csv file into pandas dataframe.

head function displays the top 5 records in the dataframe.

From the top 5 records It can be noticed that there are many missing values in the dataset.

In [4]:
data.shape
Out[4]:
(92070, 13)

shape function is used to display number of rows and columns in a dataframe.

There are 92070 records and 13 fields in the dataset.

Step2: Data Summary

In [5]:
data.describe()
Out[5]:
Cases Difference FIPS Lat Long
count 91746.000000 85088.000000 50336.00000 91716.000000 91716.000000
mean 110.480184 9.627915 30431.79466 32.128978 -47.854246
std 2022.197499 160.640521 15212.33980 17.639699 72.588082
min 0.000000 -668.000000 1001.00000 -41.454500 -170.132000
25% 0.000000 0.000000 18179.00000 30.057200 -93.985197
50% 0.000000 0.000000 29180.00000 36.932559 -82.427002
75% 1.000000 0.000000 45087.00000 41.772551 1.659600
max 101739.000000 14840.000000 78000.00000 71.706900 178.065000

describe function displays the summary statistics of a dataframe like mean, median, quantiles, min and max values, by default it selects fields containing numeric datatype values.

Notice that minimum number of cases are zero. 75 percentile is 1 meaning 75% of records have number of cases less than 1 and maximum number of cases are 101739.

In [6]:
data.select_dtypes('object').describe()
Out[6]:
Case_Type Date Country_Region Province_State Admin2 Combined_Key Prep_Flow_Runtime Table_Names
count 92070 92070 92070 68196 50928 51008 92070 92070
unique 2 69 177 132 1845 3188 1 2
top Deaths 3/23/2020 US Texas Unassigned Cherry, Nebraska, US 3/30/2020 Daily Summary
freq 46035 6986 57708 4188 576 16 92070 51120

There are 2 types of cases, 177 country regions, 132 province states,1845 Admin2(this refers to county)

Province states data is available only for the below country regions: Australia, Canada, China, Denmark, France, Netherlands, United Kingdom, United States

Admin2 and combined_key data is available only for US region.

In [18]:
print("Different case types are ",list(data['Case_Type'].unique()))
Different case types are  ['Deaths', 'Confirmed']
In [15]:
data['Date'] = pd.to_datetime(data['Date'])
print("Data is collected from ",min(data['Date'])," to ",max(data['Date']))
Data is collected from  2020-01-22 00:00:00  to  2020-03-30 00:00:00
In [19]:
print("Different Table Names from which data is collected ",list(data['Table_Names'].unique()))
Different Table Names from which data is collected  ['Time Series', 'Daily Summary']

Step3: Handling Missing Values

In [20]:
data.isnull().sum()
Out[20]:
Case_Type                0
Cases                  324
Difference            6982
Date                     0
Country_Region           0
Province_State       23874
Admin2               41142
Combined_Key         41062
FIPS                 41734
Lat                    354
Long                   354
Prep_Flow_Runtime        0
Table_Names              0
dtype: int64

Province_State, Admin2, Combined_Key, FIPS have missing values as the data for these fields is not collected for all the country regions.

Number of cases is missing in 324 records and latitude and longitude information is missing in 354 records. Lets explore data further and see if any insights can be gained about why these data could be missing.

In [22]:
missing_cases_data = data[data['Cases'].isnull()]
missing_cases_data.head()
Out[22]:
Case_Type Cases Difference Date Country_Region Province_State Admin2 Combined_Key FIPS Lat Long Prep_Flow_Runtime Table_Names
33895 Confirmed NaN 44.0 2020-03-23 US Wisconsin NaN NaN NaN 43.0186 -92.3814 3/30/2020 Daily Summary
33928 Confirmed NaN 76.0 2020-03-23 US Utah NaN NaN NaN 40.1500 -112.0953 3/30/2020 Daily Summary
33984 Confirmed NaN 20.0 2020-03-23 US Kansas NaN NaN NaN 38.5266 -96.7265 3/30/2020 Daily Summary
34049 Confirmed NaN 109.0 2020-03-23 US Tennessee NaN NaN NaN 35.1269 -89.9253 3/30/2020 Daily Summary
34111 Confirmed NaN 2.0 2020-03-23 US Wyoming NaN NaN NaN 42.7560 -107.3025 3/30/2020 Daily Summary
In [28]:
print(missing_cases_data['Date'].unique())
print(missing_cases_data['Country_Region'].unique())
print(missing_cases_data['Table_Names'].unique())
['2020-03-23T00:00:00.000000000' '2020-03-30T00:00:00.000000000'
 '2020-03-24T00:00:00.000000000' '2020-03-27T00:00:00.000000000'
 '2020-03-25T00:00:00.000000000' '2020-03-28T00:00:00.000000000'
 '2020-03-29T00:00:00.000000000' '2020-03-26T00:00:00.000000000']
['US']
['Daily Summary']

The missing cases are all from US Daily Summary data from 23th March to 30th March.

In [30]:
missing_lat_data = data[data['Lat'].isnull()]
missing_lat_data.head()
Out[30]:
Case_Type Cases Difference Date Country_Region Province_State Admin2 Combined_Key FIPS Lat Long Prep_Flow_Runtime Table_Names
227 Confirmed 705.0 0.0 2020-03-01 Cruise Ship NaN NaN NaN NaN NaN NaN 3/30/2020 Time Series
298 Deaths 0.0 0.0 2020-02-04 Cruise Ship NaN NaN NaN NaN NaN NaN 3/30/2020 Time Series
566 Confirmed 0.0 0.0 2020-02-02 Cruise Ship NaN NaN NaN NaN NaN NaN 3/30/2020 Time Series
637 Confirmed 691.0 0.0 2020-02-25 Cruise Ship NaN NaN NaN NaN NaN NaN 3/30/2020 Time Series
1205 Deaths 3.0 1.0 2020-02-23 Cruise Ship NaN NaN NaN NaN NaN NaN 3/30/2020 Time Series
In [32]:
print(missing_lat_data['Country_Region'].unique())
['Cruise Ship' 'US']
In [35]:
missing_lat_data['Country_Region'].value_counts()
Out[35]:
US             216
Cruise Ship    138
Name: Country_Region, dtype: int64
In [56]:
data[(data['Country_Region'] == 'US')|(data['Country_Region'] == 'Cruise Ship')]['Country_Region'].value_counts()
Out[56]:
US             57708
Cruise Ship      138
Name: Country_Region, dtype: int64

As Cruise Ship is not a region it has missing latitude and longitude values. There are 138 regions in US with no latitude and longitude information. Lets explore further to understand why the data is missing.

In [83]:
print(missing_lat_data[missing_lat_data['Country_Region'] == 'US']['Table_Names'].unique())
print(missing_lat_data[missing_lat_data['Country_Region'] == 'US']['Date'].unique())
print(missing_lat_data[missing_lat_data['Country_Region'] == 'US']['Cases'].unique())
['Daily Summary']
['2020-03-23T00:00:00.000000000' '2020-03-30T00:00:00.000000000'
 '2020-03-24T00:00:00.000000000' '2020-03-27T00:00:00.000000000'
 '2020-03-25T00:00:00.000000000' '2020-03-28T00:00:00.000000000'
 '2020-03-29T00:00:00.000000000' '2020-03-26T00:00:00.000000000']
[nan]

The missing location information in US is during the duration 23rd March to 30th March when cases are missing.

While performing analysis we will just ignore the missing data.

Step4: Plots

In [87]:
data.head()
Out[87]:
Case_Type Cases Difference Date Country_Region Province_State Admin2 Combined_Key FIPS Lat Long Prep_Flow_Runtime Table_Names
0 Deaths 0.0 0.0 2020-02-27 Bahamas NaN NaN NaN NaN 25.0343 -77.3963 3/30/2020 Time Series
1 Confirmed 16.0 0.0 2020-02-16 Germany NaN NaN NaN NaN 51.0000 9.0000 3/30/2020 Time Series
2 Deaths 0.0 0.0 2020-01-31 Canada Alberta NaN NaN NaN 53.9333 -116.5765 3/30/2020 Time Series
3 Deaths 0.0 0.0 2020-03-04 Australia Queensland NaN NaN NaN -28.0167 153.4000 3/30/2020 Time Series
4 Confirmed 0.0 0.0 2020-01-25 Suriname NaN NaN NaN NaN 3.9193 -56.0278 3/30/2020 Time Series
In [88]:
import plotly.express as px
fig = px.pie(data,names='Case_Type', title='Number of cases of each type')
fig.show()

There is 50% data with deaths and 50% data with confirmed cases. This is beacuse for each country, region both deaths data and confirmed cases data is collected on each day.

In [93]:
fig = px.pie(data,names='Country_Region', title='Number of records for each region')
fig.update_traces(textposition='inside', textinfo='percent+label')
fig.show()

62.7% of records have US data, the main reason behind this is in US data is displayed for each State and county as well. The other best way to view number of data points for each country is to filter data by selecting country and unique dates.

In [ ]:
fig = px.pie(data,names='Country_Region', title='Number of records for each region')
fig.update_traces(textposition='inside', textinfo='percent+label')
fig.show()